Decomposing scRNA-seq data using NMF - a demo
Non-negative matrix factorization is a tool for the analysis of high dimensional data that allows extracting sparse and meaningful features from a set of non-negative data vectors. It is well suited for decomposing scRNA-seq data, effectively reducing large complex matrices (\(10^4\) of genes times \(10^5\) of cells) into a few interpretable gene programs. It has been especially used to extract recurrent gene programs in cancer cells (see e.g. Barkely et al. (2022) and Gavish et al. (2023)), which are otherwise difficult to integrated and analyse jointly.
Here, to illustrate the methods implemented in the GeneNMF package, we will apply NMF on a single-cell cell dataset of human PBMCs - a downsampled version of the dataset published by Hao et al. (2021).
Set up the enviroment
Here are some packages you’ll need for this demo:
library(Seurat)
library(ggplot2)
library(UCell)
library(patchwork)
library(Matrix)
library(RcppML)
library(dendextend)
library(viridis)
remotes::install_github("carmonalab/GeneNMF")
library(GeneNMF)Then download the test dataset for this demo.
ddir <- "input"
data.path <- sprintf("%s/pbmc_multimodal.downsampled20k.seurat.rds", ddir)
if (!file.exists(data.path)) {
dir.create(ddir)
dataUrl <- "https://www.dropbox.com/s/akzu3hp4uz2mpkv/pbmc_multimodal.downsampled20k.seurat.rds?dl=1"
download.file(dataUrl, data.path)
}
seu <- readRDS(data.path)NMF for dimensionality reduction
NMF can be applied to reduce the dimensionality of the data from tens
of thousand of genes to a few dimensions (similarly to PCA). With the
RunNMF() function, it can be directly applied on a Seurat
object, and it will save the NMF results as a new dimensionality
reduction.
ndim <- 15
seu <- FindVariableFeatures(seu, nfeatures = 1000)
seu <- RunNMF(seu, k = ndim, assay="SCT")## A dimensional reduction object with key NMF_
## Number of dimensions: 15
## Number of cells: 20000
## Projected dimensional reduction calculated: FALSE
## Jackstraw run: FALSE
## Computed using assay: SCT
We can also further reduced the dimensionality to 2 dimensions using UMAP; in this space we can visualize all cells in a single plot.
seu <- RunUMAP(seu, reduction = "NMF", dims=1:ndim, reduction.name = "NMF_UMAP", reduction.key = "nmfUMAP_")DimPlot(seu, reduction = "NMF_UMAP", group.by = "celltype.l1", label=T) + theme(aspect.ratio = 1,
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank()) + ggtitle("NMF UMAP") + NoLegend()Consistent NMF programs across multiple samples
Identification of robust gene programs requires their detection
across samples and variability of input parameters. Perhaps the most
crucial parameter to NMF is the dimensionality k, which
corresponds to the number of programs of the low-dimensional matrix. To
determine robust programs, we can run NMF over multiple numbers of
k and determine programs that are consistenly found across
these runs. The multiNMF() function automatically performs
NMF over a list of samples and for multiple values of
k:
seu.list <- SplitObject(seu, split.by = "donor")
nmf.res <- multiNMF(seu.list, assay="SCT", slot="data", k=4:9, L1=c(0,0),
do_centering=TRUE, calculate_hvg = TRUE, nfeatures = 2000,
exclude_ribo_mito=FALSE)We can now compare the gene programs identified over multiple samples
and numbers of k and inspect their similarity. It can be
useful to visualize pairwise similarity (Jaccard Index) between gene
programs:
ph <- getNMFheatmap(nmf.res, method=0.5, max.genes=50,
hclust.method="ward.D2",
jaccard.cutoff = c(0,0.8))We can see “blocks” corresponding to gene programs of high similarity
across datasets and values of k. We can then cut the
similarity tree at a given height to find blocks of similar programs and
derive consensus gene signatures for each block. For example, here we
cut the tree to the height corresponding to 10 clusters of programs
(which we can call meta-programs or MPs):
nprograms = 10
nmf.genes <- getNMFgenesConsensus(nmf.res, method=0.5, max.genes=50,
hclust.method="ward.D2", nprograms=nprograms,
min.confidence=0.3,
plot.tree=TRUE, return.tree=FALSE)We can also extract the “coverage” for consensus meta-programs; i.e. in what fraction of samples it was detected. On this dataset, all programs except Programs 8 and 9 were found in all samples.
nmf.cov <- getNMFgenesConsensus(nmf.res, method=0.5, max.genes=50,
hclust.method="ward.D2", nprograms=nprograms,
min.confidence=0.3, plot.tree = FALSE,
return.coverage = TRUE)
t(as.data.frame(nmf.cov))## [,1]
## Program1 1.000
## Program2 1.000
## Program3 1.000
## Program4 1.000
## Program5 1.000
## Program6 1.000
## Program7 1.000
## Program8 0.875
## Program9 0.875
## Program10 1.000
What are the genes driving each program?
## [,1] [,2] [,3] [,4] [,5] [,6]
## Program1 "CD14" "CXCL8" "CYP1B1" "EGR1" "FCN1" "FOS"
## Program2 "BANK1" "CD79A" "FCRL5" "IGHD" "IGHM" "LINC00926"
## Program3 "ACRBP" "CAVIN2" "CLEC1B" "PRKAR2B" "ITGA2B" "SPARC"
## Program4 "ADGRG1" "CTSW" "FGFBP2" "GNLY" "GZMB" "IL2RB"
## Program5 "CCR7" "LEF1" "LRRN3" "MAL" "PASK" "TSHZ2"
## Program6 "C1QA" "CDKN1C" "CKB" "CTSL" "HES4" "LYPD2"
## Program7 "SERPINB2" "VCAN" "PLBD1" "RBP7" "S100A12" "S100A9"
## Program8 "CLEC10A" "ENHO" "FCER1A" "PID1" "C1orf54" "CD1C"
## Program9 "CAMK2N1" "GZMK" "KLRG1" "LINC01871" "LYAR" "MAF"
## Program10 "IER3" "C15orf48" "CCL3L1" "SGK1" "CXCL8" "G0S2"
Intepretation of gene programs by GSEA
To aid the intepretation of gene programs, we can compare them to
known signatures from public databases. The runGSEA()
function can be useful to scan msigDB and evaluate the overlap of
detected gene programs with signatures in the databases. Here we compare
to the “C8” category (cell type signature gene sets); but other classes
such as “H” (hallmark gene sets) may be more relevant in other
contexts.
top_p <- lapply(nmf.genes, function(program) {
runGSEA(program, universe=rownames(seu), category = "C8")
})For example, Program 2 appears to correlate significantly with a B cell signature:
## pathway pval
## <char> <num>
## 1: DURANTE_ADULT_OLFACTORY_NEUROEPITHELIUM_DENDRITIC_CELLS 2.903967e-64
## 2: AIZARANI_LIVER_C23_KUPFFER_CELLS_3 9.833693e-50
## 3: TRAVAGLINI_LUNG_CLASSICAL_MONOCYTE_CELL 1.225900e-42
## 4: CUI_DEVELOPING_HEART_C8_MACROPHAGE 3.011105e-40
## 5: HAY_BONE_MARROW_NEUTROPHIL 5.016608e-36
## 6: RUBENSTEIN_SKELETAL_MUSCLE_MYELOID_CELLS 1.868127e-35
## padj overlap size overlapGenes
## <num> <int> <int> <list>
## 1: 2.026969e-61 33 116 AP1S2, C....
## 2: 3.431959e-47 31 216 AP1S2, A....
## 3: 2.852261e-40 29 264 AP1S2, A....
## 4: 5.254379e-38 28 271 C5AR1, C....
## 5: 7.003184e-34 29 439 AP1S2, A....
## 6: 2.173254e-33 27 341 AP1S2, C....
Signature scores for gene programs
A simple way to evaluate gene programs learned from the data is to
calculate gene signature scores with the UCell package.
FeaturePlot(seu, features=names(nmf.genes), ncol=5) & theme(aspect.ratio = 1,
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank())We can see how many of the programs are enriched in specific cell subtypes (cell type annotation from the original study).
Signature scores to define integrated space
Individual cells can now be represented in terms of their gene program scores. Importantly, here the gene programs were learned as a consensus of gene programs found across multiple samples – as opposed to calculating NMF once on the whole dataset. This can be an effective strategy to mitigate batch effects, as robust gene programs (aka metaprograms) are a consensus of gene programs consistently found across individual samples. Let’s store these coordinates in the Seurat object:
matrix <- seu@meta.data[,names(nmf.genes)]
dimred <- scale(matrix)
colnames(dimred) <- paste0("NMF_",seq(1, ncol(dimred)))
#New dim reduction
seu@reductions[["NMFsignatures"]] <- new("DimReduc",
cell.embeddings = dimred,
assay.used = "RNA",
key = "NMF_",
global = FALSE)We can also use these scores to generate a UMAP representation and visualize the data in 2D:
set.seed(123)
seu <- RunUMAP(seu, reduction="NMFsignatures", dims=1:length(seu@reductions[["NMFsignatures"]]),
metric = "euclidean", reduction.name = "umap_NMF")How do the signature scores for the consensus programs look like in the combined space?
FeaturePlot(seu, features = names(nmf.genes), reduction = "umap_NMF", ncol=4) &
scale_color_viridis(option="B") &
theme(aspect.ratio = 1, axis.text=element_blank(), axis.ticks=element_blank())Final remarks
NMF can be a powerful tool to extract gene programs for scRNA-seq data in an unbiased manner. Because it is calculated for each sample separately, it bypasses the need to perform batch effect correction to analyse samples jointly. This aspect makes it particularly interesting for the analysis of gene programs in cancer cells (see e.g. Barkely et al. (2022) and Gavish et al. (2023).